MGT 6203: Data Analytics in Business
July 24, 2022
Introduction
Motivation
Our objective of this project is to use data compiled by New York
City police officers (NYPD) detailing New York motor vehicle collision
reports to help a fictitious auto repair center in New York City
estimate the volume of incoming vehicles they can expect to repair in
the coming year, assuming their market share is known. Our analysis will
help this repair center predict staffing levels that they will need to
maintain and identify potential opportunities for expansion.
Additionally, we analyze the demographics of those involved in
collisions and identify the groups that make a significant contribution
to car collisions. We propose and measure the impact of a marketing
campaign for this repair center. We also analyze the impact of COVID-19
on this industry sector.
Assumptions
We assume equal market share for the 710 (Motor Vehicles) body shops found in New York
City. This gives us an average market share of 0.1408% per shop, which
we will use to estimate the volume of incoming vehicles they can
expect.
Hypotheses
After our initial data exploration, we developed a few
hypotheses:
Primary hypotheses
- Be able to estimate the monthly demand for a fictitious auto shop in
New York state by extrapolating the monthly car crash volume with
reasonable accuracy.
- Identify a demographic that comprises the largest market segment for
motor vehicle repairs after an accident; from our preliminary
investigation, we suspect that young men comprise this demographic.
Secondary hypotheses
- Be able to assert the impact of advertising expenditure on the auto
shop’s bottom line.
- Be able to quantify the impact of COVID-19 on demand in this
sector.
Literature review
Demographics
In their research, Regev et. al. (Regev,
Rolison, & Moutari) evaluated the crash risk of drivers of
different ages and genders adjusting for travel exposure and
found that drivers between the ages of 21 and 29 carry the highest level
of risk for being in a car crash. This contrasts prior research that
suggested that teenagers and elderly drivers carry the highest risk for
car collisions since that research did not control for the effect of
travel exposure (Regev, Rolison, &
Moutari, 131). In the team’s initial discussions, we hypothesised
that the data would show that teenagers and elderly folks have the
highest rate of car collisions. This article broadened our perspective
and set us up to let the data guide us.
Location
He et. al. (He et al.) developed a
technique to predict where accidents happen. This work would allow
policymakers to install traffic calming devices in areas that are
predetermined to be risky. While we did not build directly on their
work, we were inspired by the advanced models that they built.
COVID-19
Li and Zhao (Li & Zhao) determined
that while the overall volume of car collisions has plummeted since the
start of the pandemic, cyclists’ fatalities have tripled when the
quantity of traffic is controlled for. This presents an area of further
research for us. To dig deeper into the true impact of the pandemic, we
would have to collect or find data that we could then merge with our
existing data sets to glean further insights. Due to the fact that our
time in this course is limited, we considered this to be out of scope
for this project.
Approach
To perform this analysis, we take several steps:
Data wrangling
In this step, we will read into memory, and clean the data at hand.
We use 3 data sets:
- Data detailing the collisions,
Crashes,
- Data detailing the vehicles involved in collisions,
Vehicles, and
- Data detailing the people involved in the collisions,
Person.
After the data is read, we display a few summaries to give the reader
a sense of the scope in which we are dealing. We then merge the 3 data
sets so that we can perform analysis smoothly.
Data exploration
After the initial wrangling is done, we then explore the data by
making a few plots. One effect we reflect on is the effect of COVID-19
on collisions in New York City.
Feature selection
We use several strategies to perform analysis with the features that
have the most predictive power:
- Manual predictor selection
- LASSO
- Elastic net
Modeling
We split the data into 3 sets: training, validation and test data
sets. We then take advantage of 5 different modeling techniques:
- Linear regression,
- Random forest, and
- Classification and regression tree, and
- Gradient Boosted Trees, and
- Ensemble of all 4 models above.
For each, we visualize the output and then analyze its performance on
the validation data set. We use \(R^2\)
to analyze model performance.
Model comparison
The objective here is to summarise the models’ performance metrics to
get an instant idea of which model is the best. Note that we compare the
models’ performances on the validation data set.
Prediction
Finally, we predict the collision volume on the test data set to
estimate the chosen model’s performance. This prediction also serves as
the basis for our recommendations for the auto shop.
Effect of a marketing campaign
We then perform an analysis to understand the impact of a marketing
campaign on the auto repair shop’s business.
Effect of COVID-19
We also evaluate the effect of COVID-19 on the amount of business
that the shop can get.
Conclusion
We then draw conclusions based on the results obtained above.
Useful libraries
We import necessary libraries. These libraries will be used as
follows:
dplyr |
Data manipulation |
knitr |
RMarkdown knitting |
kableExtra |
Kable tables |
plotly |
Plotting |
lubridate |
Date manipulation |
randomForest |
Building Random Forest models |
rpart |
Building CART models |
rpart.plot |
Plotting regression trees built |
xgboost |
Building Boosted Tree models |
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(lubridate)
library(randomForest)
library(rpart)
library(rpart.plot)
library(xgboost)
Data
Summaries
Below, we import data from 3 sources and view the raw summaries. The
data are related to each other as follows: !Data relationships
These functions will be reused in our endeavour to read and summarise
the data at hand.
read_data <- function(path) {
data <- read.csv(path, stringsAsFactors = FALSE)
colnames(data)[colnames(data) == 'CRASH.DATE'] <- 'CRASH_DATE'
data$CRASH_DATE <- as.Date(data$CRASH_DATE, "%m/%d/%Y")
data
}
tabulate_collisions <- function(data) {
kable(t(summary(data))) %>% kable_classic(full_width = TRUE, html_font = "Cambria",
font_size = 14)
}
Crashes
crashes_df <- read_data('../Data/Crashes.csv') #1,896,229 x 29
tabulate_collisions(crashes_df)
|
|
|
|
|
|
|
|
|
|
CRASH_DATE
|
Min. :2012-07-01
|
1st Qu.:2014-10-28
|
Median :2016-12-15
|
Mean :2017-01-01
|
3rd Qu.:2019-01-03
|
Max. :2022-05-27
|
NA
|
|
CRASH.TIME
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
BOROUGH
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
ZIP.CODE
|
Min. :10000
|
1st Qu.:10306
|
Median :11207
|
Mean :10837
|
3rd Qu.:11237
|
Max. :11697
|
NA’s :587485
|
|
LATITUDE
|
Min. : 0.00
|
1st Qu.:40.67
|
Median :40.72
|
Mean :40.64
|
3rd Qu.:40.77
|
Max. :43.34
|
NA’s :219988
|
|
LONGITUDE
|
Min. :-201.36
|
1st Qu.: -73.98
|
Median : -73.93
|
Mean : -73.77
|
3rd Qu.: -73.87
|
Max. : 0.00
|
NA’s :219988
|
|
LOCATION
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
ON.STREET.NAME
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CROSS.STREET.NAME
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
OFF.STREET.NAME
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
NUMBER.OF.PERSONS.INJURED
|
Min. : 0.0000
|
1st Qu.: 0.0000
|
Median : 0.0000
|
Mean : 0.2872
|
3rd Qu.: 0.0000
|
Max. :43.0000
|
NA’s :18
|
|
NUMBER.OF.PERSONS.KILLED
|
Min. :0.000000
|
1st Qu.:0.000000
|
Median :0.000000
|
Mean :0.001357
|
3rd Qu.:0.000000
|
Max. :8.000000
|
NA’s :31
|
|
NUMBER.OF.PEDESTRIANS.INJURED
|
Min. : 0.00000
|
1st Qu.: 0.00000
|
Median : 0.00000
|
Mean : 0.05304
|
3rd Qu.: 0.00000
|
Max. :27.00000
|
NA
|
|
NUMBER.OF.PEDESTRIANS.KILLED
|
Min. :0.000000
|
1st Qu.:0.000000
|
Median :0.000000
|
Mean :0.000697
|
3rd Qu.:0.000000
|
Max. :6.000000
|
NA
|
|
NUMBER.OF.CYCLIST.INJURED
|
Min. :0.00000
|
1st Qu.:0.00000
|
Median :0.00000
|
Mean :0.02433
|
3rd Qu.:0.00000
|
Max. :4.00000
|
NA
|
|
NUMBER.OF.CYCLIST.KILLED
|
Min. :0.0000000
|
1st Qu.:0.0000000
|
Median :0.0000000
|
Mean :0.0001002
|
3rd Qu.:0.0000000
|
Max. :2.0000000
|
NA
|
|
NUMBER.OF.MOTORIST.INJURED
|
Min. : 0.0000
|
1st Qu.: 0.0000
|
Median : 0.0000
|
Mean : 0.2082
|
3rd Qu.: 0.0000
|
Max. :43.0000
|
NA
|
|
NUMBER.OF.MOTORIST.KILLED
|
Min. :0.00000
|
1st Qu.:0.00000
|
Median :0.00000
|
Mean :0.00055
|
3rd Qu.:0.00000
|
Max. :5.00000
|
NA
|
|
CONTRIBUTING.FACTOR.VEHICLE.1
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING.FACTOR.VEHICLE.2
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING.FACTOR.VEHICLE.3
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING.FACTOR.VEHICLE.4
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING.FACTOR.VEHICLE.5
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
COLLISION_ID
|
Min. : 22
|
1st Qu.:3046533
|
Median :3583981
|
Mean :3020876
|
3rd Qu.:4058140
|
Max. :4532390
|
NA
|
|
VEHICLE.TYPE.CODE.1
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE.TYPE.CODE.2
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE.TYPE.CODE.3
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE.TYPE.CODE.4
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE.TYPE.CODE.5
|
Length:1895581
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
Person
person_df <- read_data('../Data/Person.csv') #4,692,054 x 21
tabulate_collisions(person_df)
|
|
|
|
|
|
|
|
|
|
UNIQUE_ID
|
Min. : 10922
|
1st Qu.: 6811622
|
Median : 8995010
|
Mean : 8530078
|
3rd Qu.:10214090
|
Max. :12236543
|
NA
|
|
COLLISION_ID
|
Min. : 37
|
1st Qu.:3638730
|
Median :3921535
|
Mean :3852979
|
3rd Qu.:4210243
|
Max. :4532390
|
NA
|
|
CRASH_DATE
|
Min. :2012-07-01
|
1st Qu.:2017-03-19
|
Median :2018-06-07
|
Mean :2018-07-07
|
3rd Qu.:2019-09-20
|
Max. :2022-05-27
|
NA
|
|
CRASH_TIME
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PERSON_ID
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PERSON_TYPE
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PERSON_INJURY
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_ID
|
Min. : 123423
|
1st Qu.:17465986
|
Median :18528303
|
Mean :18252666
|
3rd Qu.:19124376
|
Max. :20228127
|
NA’s :185601
|
|
PERSON_AGE
|
Min. :-999.0
|
1st Qu.: 23.0
|
Median : 35.0
|
Mean : 36.8
|
3rd Qu.: 50.0
|
Max. :9999.0
|
NA’s :452863
|
|
EJECTION
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
EMOTIONAL_STATUS
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
BODILY_INJURY
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
POSITION_IN_VEHICLE
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
SAFETY_EQUIPMENT
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PED_LOCATION
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PED_ACTION
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
COMPLAINT
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PED_ROLE
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING_FACTOR_1
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING_FACTOR_2
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PERSON_SEX
|
Length:4689796
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
Vehicles
vehicles_df <- read_data('../Data/Vehicles.csv') #3,704,406 x 25
tabulate_collisions(vehicles_df)
|
|
|
|
|
|
|
|
|
|
UNIQUE_ID
|
Min. : 111711
|
1st Qu.:14215160
|
Median :17306058
|
Mean :16060871
|
3rd Qu.:18739205
|
Max. :20121717
|
NA
|
|
COLLISION_ID
|
Min. : 22
|
1st Qu.:3017853
|
Median :3567068
|
Mean :2996659
|
3rd Qu.:4028145
|
Max. :4484197
|
NA
|
|
CRASH_DATE
|
Min. :2012-07-01
|
1st Qu.:2014-10-15
|
Median :2016-11-18
|
Mean :2016-11-21
|
3rd Qu.:2018-11-15
|
Max. :2021-12-04
|
NA
|
|
CRASH_TIME
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_ID
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
STATE_REGISTRATION
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_TYPE
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_MAKE
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_MODEL
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_YEAR
|
Min. : 1000
|
1st Qu.: 2008
|
Median : 2013
|
Mean : 2015
|
3rd Qu.: 2016
|
Max. :20063
|
NA’s :1796971
|
|
TRAVEL_DIRECTION
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_OCCUPANTS
|
Min. :0.00e+00
|
1st Qu.:1.00e+00
|
Median :1.00e+00
|
Mean :1.01e+03
|
3rd Qu.:1.00e+00
|
Max. :1.00e+09
|
NA’s :1718406
|
|
DRIVER_SEX
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
DRIVER_LICENSE_STATUS
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
DRIVER_LICENSE_JURISDICTION
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PRE_CRASH
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
POINT_OF_IMPACT
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_DAMAGE
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_DAMAGE_1
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_DAMAGE_2
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
VEHICLE_DAMAGE_3
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PUBLIC_PROPERTY_DAMAGE
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
PUBLIC_PROPERTY_DAMAGE_TYPE
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING_FACTOR_1
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
|
CONTRIBUTING_FACTOR_2
|
Length:3704406
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
NA
|
Note the large sizes of the data sets we are dealing with.
Wrangling
As a result of a traffic safety initiative to eliminate traffic
fatalities, the NYPD replaced its record management system with an
electronic one, (FORMS), in 2016 ((“Motor Vehicle
Collisions - Crashes”)). We see this change reflected
in the chart below.
monthly_agg1 <- (person_df %>%
mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
group_by(yr_mo) %>%
summarise(n = n()) %>%
arrange(yr_mo)
)
plot_ly(x=monthly_agg1$yr_mo, y=monthly_agg1$n, type='bar') %>%
layout(title = "Crash Data by Month from 2012-2022", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))
Note that the amount of data collected from March 2016 on greatly
surpasses the amount previously collected. To control for the change in
data recording systems, we will use data collected after January 1st,
2017 for full year modeling.
We are now ready to filter the data to select columns of interest. We
will also combine data from our three sources into one place, using
their common factors. A summary can be found below.
combined_df <- (crashes_df %>%
select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>%
inner_join(vehicles_df %>%
filter(!is.na(VEHICLE_ID)) %>%
select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
, by = 'COLLISION_ID') %>%
inner_join(person_df %>%
select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
, by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>%
filter((PED_ROLE %in% c('Driver'))) %>% #drivers only
filter((CRASH_DATE >= '2017-01-01') & (CRASH_DATE < '2021-12-01')) %>% #only from 2017-01-01 to 2021-11-30
filter(PERSON_AGE > 14 & PERSON_AGE< 101) %>%
filter(PERSON_SEX == 'F' | PERSON_SEX=='M') %>%
mutate(MONTH = substr(CRASH_DATE,6,7)) %>%
mutate(TIMESTEP = as.period(interval(as.Date('2017-01-01'), CRASH_DATE)) %/% months(1)) %>%
mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
select(COLLISION_ID, CRASH_DATE, PERSON_AGE, PERSON_SEX, MONTH, TIMESTEP, yr_mo) %>%
arrange(CRASH_DATE)
)
kable(t(summary(combined_df))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
|
|
|
|
|
|
|
|
|
COLLISION_ID
|
Min. :3589945
|
1st Qu.:3805149
|
Median :4017960
|
Mean :4023043
|
3rd Qu.:4234930
|
Max. :4484186
|
|
CRASH_DATE
|
Min. :2017-01-01
|
1st Qu.:2017-12-04
|
Median :2018-11-05
|
Mean :2018-12-28
|
3rd Qu.:2019-11-01
|
Max. :2021-11-30
|
|
PERSON_AGE
|
Min. : 15.00
|
1st Qu.: 29.00
|
Median : 40.00
|
Mean : 41.69
|
3rd Qu.: 53.00
|
Max. :100.00
|
|
PERSON_SEX
|
Length:1339106
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
|
MONTH
|
Length:1339106
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
|
TIMESTEP
|
Min. : 0.00
|
1st Qu.:11.00
|
Median :22.00
|
Mean :23.44
|
3rd Qu.:34.00
|
Max. :58.00
|
|
yr_mo
|
Length:1339106
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
Now, our data set is ready for exploration.
Data Exploration
We create our aggregate of volume data available for modeling and
visualize it with a plot. It is interesting to note the impact of
COVID-19 on crash volume beginning in March 2020.
monthly_agg2 <- (combined_df %>%
group_by(TIMESTEP, yr_mo, MONTH) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP, yr_mo, MONTH)
)
plot_ly(x=monthly_agg2$yr_mo, y=monthly_agg2$n, type='bar') %>%
layout(title = "Crash Data by Month from 2017-2021", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))
Modeling
Splitting into training, validation, and testing data sets
We also created a coarse aggregate for modeling with feature groups
and separate our data into training, validation, and test data sets.
Since we are dealing with a temporal model, we will select 2017-01 to
2019-12 for training, 2020-01 to 2020-12 for validation, and for 2021-01
to 2021-11 testing. A random split would be nonsensical for our
purposes. After hyper parameters for each model have been tuned using
training and validation data sets, the models will be re-trained using
the training + validation data sets. This ensures that we maximize the
data for learning, but also that validation steps are completed
properly.
monthly_agg3 <- (combined_df %>%
group_by(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE)
)
#Create Training and Test data sets
train_df <- monthly_agg3 %>% filter(TIMESTEP <= 35) # 2017-01 to 2019-12
val_df <- monthly_agg3 %>% filter((TIMESTEP > 35) & (TIMESTEP <= 47)) # 2020-01 to 2020-12
test_df <- monthly_agg3 %>% filter(TIMESTEP > 47) # 2021-01 to 2021-11
#Create Train_Val data sets (final model after Train/Val hyper parameter tuning)
train_val_df <- monthly_agg3 %>% filter(TIMESTEP <= 47) # 2017-01 to 2020-12
Linear Regression
Training
#Train Linear Model
lm_model <- lm(data = train_df, formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX )
summary(lm_model)
##
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -463.96 -73.77 15.27 66.20 269.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 335.67839 6.71275 50.006 < 2e-16 ***
## TIMESTEP -0.71945 0.15283 -4.708 2.57e-06 ***
## MONTH02 -13.56568 7.36429 -1.842 0.065513 .
## MONTH03 13.23735 7.36933 1.796 0.072503 .
## MONTH04 5.12239 7.37183 0.695 0.487170
## MONTH05 35.55638 7.38663 4.814 1.52e-06 ***
## MONTH06 35.40924 7.39397 4.789 1.72e-06 ***
## MONTH07 20.93237 7.40274 2.828 0.004705 **
## MONTH08 15.48523 7.43398 2.083 0.037292 *
## MONTH09 19.43578 7.45095 2.608 0.009117 **
## MONTH10 28.85893 7.47622 3.860 0.000115 ***
## MONTH11 21.90978 7.50981 2.917 0.003542 **
## MONTH12 20.60703 7.55127 2.729 0.006373 **
## PERSON_AGE -4.45608 0.06404 -69.578 < 2e-16 ***
## PERSON_SEXM 171.44422 2.99888 57.169 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.2 on 5791 degrees of freedom
## Multiple R-squared: 0.5799, Adjusted R-squared: 0.5788
## F-statistic: 570.9 on 14 and 5791 DF, p-value: < 2.2e-16
SST_lm_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_lm_train <- sum((train_df$n - predict(lm_model, newdata = train_df))^2)
R_sq_lm_train <- 1-(SSE_lm_train/SST_lm_train)
R_sq_lm_train #0.5799
## [1] 0.5798524
#Linear Model Validation Metrics
SST_lm_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_lm_val <- sum((val_df$n - predict(lm_model, newdata = val_df))^2)
R_sq_lm_val <- 1-(SSE_lm_val/SST_lm_val)
R_sq_lm_val #-1.154703
## [1] -1.154616
#Train_Val Linear Model
lm_model_2 <- lm(data = train_val_df, formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX )
summary(lm_model_2)
##
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX,
## data = train_val_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -441.83 -64.83 4.23 58.78 299.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 352.82598 5.76802 61.169 < 2e-16 ***
## TIMESTEP -2.92530 0.09545 -30.646 < 2e-16 ***
## MONTH02 -10.16351 6.23069 -1.631 0.102889
## MONTH03 5.85024 6.23280 0.939 0.347955
## MONTH04 -17.30059 6.27380 -2.758 0.005837 **
## MONTH05 13.68088 6.25906 2.186 0.028863 *
## MONTH06 19.15099 6.25343 3.062 0.002203 **
## MONTH07 13.20462 6.25123 2.112 0.034691 *
## MONTH08 12.80755 6.26343 2.045 0.040908 *
## MONTH09 18.12040 6.27354 2.888 0.003883 **
## MONTH10 27.72229 6.28431 4.411 1.04e-05 ***
## MONTH11 22.61062 6.30315 3.587 0.000336 ***
## MONTH12 23.03260 6.32483 3.642 0.000273 ***
## PERSON_AGE -3.92660 0.05504 -71.344 < 2e-16 ***
## PERSON_SEXM 150.90222 2.55012 59.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 111.3 on 7619 degrees of freedom
## Multiple R-squared: 0.5457, Adjusted R-squared: 0.5449
## F-statistic: 653.8 on 14 and 7619 DF, p-value: < 2.2e-16
SST_lm_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_lm_train_val <- sum((train_val_df$n - predict(lm_model_2, newdata = train_val_df))^2)
R_sq_lm_train_val <- 1-(SSE_lm_train_val/SST_lm_train_val)
R_sq_lm_train_val #0.545718
## [1] 0.5457191
Visualisation
plot_ly(x=train_val_df$yr_mo, y=(predict(lm_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Linear Model Residuals vs yr_mo for Train_Val Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'LM Residuals', rangemode = "tozero"))
Test Metrics
# Verify performance on Test dataset
SST_lm_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_lm_test <- sum((test_df$n - predict(lm_model_2, newdata = test_df))^2)
R_sq_lm_test <- 1-(SSE_lm_test/SST_lm_test)
R_sq_lm_test #0.08348344
## [1] 0.08364779
Our linear model with training + validation data had an R-squared
value of 0.55, while it’s performance on test data dropped the R-squared
value to 0.08.
Random forest
Training
# #Train & Val Random Forest Model - Loop - Run only once to get optimal settings
# rf_results <- data.frame()
# for(i in seq.int(1, 200)){
# set.seed(12345)
# rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=i) #
# summary(rf_model)
# var_importance_df <- data.frame(importance(rf_model)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
# var_importance_df
# SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
# SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
# R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
# R_sq_rf_train #0.8670912
#
# #Random Forest Model Validation Metrics
# SST_rf_val <- sum((val_df$n - mean(val_df$n))^2)
# SSE_rf_val <- sum((val_df$n - predict(rf_model, newdata = val_df))^2)
# R_sq_rf_val <- 1-(SSE_rf_val/SST_rf_val)
# R_sq_rf_val #-0.5731295
#
# rf_results <- rf_results %>% bind_rows(data.frame(i=i, R_sq_rf_train=R_sq_rf_train, R_sq_rf_val=R_sq_rf_val))
# }
#
# rf_results %>% filter(rf_results$R_sq_rf_val == max(rf_results$R_sq_rf_val))
# # i R_sq_rf_train R_sq_rf_val
# # 1 41 0.8668467 -0.49019
# Best Model from Train & Val
set.seed(12345)
rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=41) #41
SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
R_sq_rf_train #0.8668467
## [1] 0.8503601
#Random Forest Model Validation Metrics
SST_rf_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_rf_val <- sum((val_df$n - predict(rf_model, newdata = val_df))^2)
R_sq_rf_val <- 1-(SSE_rf_val/SST_rf_val)
R_sq_rf_val #-0.49019
## [1] -0.6534223
# Train_Val RF model
set.seed(12345)
rf_model_2 <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_val_df, importance=TRUE, ntree=41) #41
rf_var_importance_df <- data.frame(importance(rf_model_2)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
SST_rf_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_rf_train_val <- sum((train_val_df$n - predict(rf_model_2, newdata = train_val_df))^2)
R_sq_rf_train_val <- 1-(SSE_rf_train_val/SST_rf_train_val)
R_sq_rf_train_val #0.8480507
## [1] 0.8467027
Visualisation
# Make plot for Random Forest Feature Importance
plot_ly(x=rf_var_importance_df$PCT_IncMSE , y=reorder(rf_var_importance_df %>% row.names(), rf_var_importance_df$PCT_IncMSE ), type='bar', orientation = 'h') %>%
layout(title = "Random Forest Feature Importance", xaxis = list(title = '% Increase MSE'), yaxis = list(title = 'Feature'))
# Make plot for Random Forest Residuals
plot_ly(x=train_val_df$yr_mo, y=(predict(rf_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Random Forest Model Residuals vs yr_mo for Train_Val Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'RF Residuals', rangemode = "tozero"))
Test Metrics
SST_rf_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_rf_test <- sum((test_df$n - predict(rf_model_2, newdata = test_df))^2)
R_sq_rf_test <- 1-(SSE_rf_test/SST_rf_test)
R_sq_rf_test #0.7298947
## [1] 0.766301
Our random forest model with training + validation data had an
R-squared value of 0.85, while it’s performance on test data dropped the
R-squared value to 0.73.
Classification and Regression Tree (CART)
Training
#Train CART Model
set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df,
control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_CART_train <- sum((train_df$n - predict(cart_model, newdata = train_df))^2)
R_sq_cart_train <- 1-(SSE_CART_train/SST_CART_train)
R_sq_cart_train #0.9473921
## [1] 0.9473918
#CART Model Validation Metrics
SST_cart_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_cart_val <- sum((val_df$n - predict(cart_model, newdata = val_df))^2)
R_sq_cart_val <- 1-(SSE_cart_val/SST_cart_val)
R_sq_cart_val #-1.753216
## [1] -1.75313
# Train_Val CART model
set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model_2 <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_val_df,
control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_CART_train_val <- sum((train_val_df$n - predict(cart_model_2, newdata = train_val_df))^2)
R_sq_cart_train_val <- 1-(SSE_CART_train_val/SST_CART_train_val)
R_sq_cart_train_val #0.9023342
## [1] 0.9023338
cart_plot_data <- (data.frame(Variable_Importance = cart_model_2$variable.importance,
Variable_Importance_Pct_Tot = round(100*cart_model_2$variable.importance/sum(cart_model_2$variable.importance),0)) %>%
as.data.frame())
Visualisation
# Make plot for CART Feature Importance
plot_ly(x=cart_plot_data$Variable_Importance_Pct_Tot , y=reorder(cart_plot_data %>% row.names(), cart_plot_data$Variable_Importance_Pct_Tot ), type='bar', orientation = 'h') %>%
layout(title = "CART Forest Feature Importance", xaxis = list(title = 'Variable_Importance_Pct_Tot'), yaxis = list(title = 'Feature'))
# Make plot for CART diagram
rpart.plot(cart_model_2)

# Make plot for CART Residuals
plot_ly(x=train_val_df$yr_mo, y=(predict(cart_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of CART Model Residuals vs yr_mo for Train_Val Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'CART Residuals', rangemode = "tozero"))
Test Metrics
SST_cart_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_cart_test <- sum((test_df$n - predict(cart_model_2, newdata = test_df))^2)
R_sq_cart_test <- 1-(SSE_cart_test/SST_cart_test)
R_sq_cart_test #0.6526338
## [1] 0.65263
Our cart model with training + validation data had an R-squared value
of 0.90, while it’s performance on test data dropped the R-squared value
to 0.65.
Extreme Gradient Boosted Tree (XGBoost)
Training
# XGB dataset feature changes (factor levels not supported, so integer casting)
train_xgb_df <- (train_df %>%
select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>%
mutate(MONTH=as.integer(as.factor(MONTH)),
PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
val_xgb_df <- (val_df %>%
select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>%
mutate(MONTH=as.integer(as.factor(MONTH)),
PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
train_val_xgb_df <- (train_val_df %>%
select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>%
mutate(MONTH=as.integer(as.factor(MONTH)),
PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
test_xgb_df <- (test_df %>%
select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>%
mutate(MONTH=as.integer(as.factor(MONTH)),
PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
# # #Train & Val Boosted Tree Model - Loop - Run only once to get optimal settings
# #Train Boosted Tree Model
# bt_results <- data.frame()
# for(i in seq.int(1, 20)){
# for(j in seq.int(3,6)){
# set.seed(12345)
# bt_model <- xgboost(data=as.matrix(train_xgb_df), label = train_df$n, objective='reg:squarederror', nthread=1, nrounds=i, max.depth=j, eta=0.05) #
# summary(bt_model)
#
# SST_bt_train <- sum((train_df$n - mean(train_df$n))^2)
# SSE_bt_train <- sum((train_df$n - predict(bt_model, newdata = as.matrix(train_xgb_df)))^2)
# R_sq_bt_train <- 1-(SSE_bt_train/SST_bt_train)
# R_sq_bt_train #0.4761988
#
# #Boosted Tree Model Validation Metrics
# SST_bt_val <- sum((val_df$n - mean(val_df$n))^2)
# SSE_bt_val <- sum((val_df$n - predict(bt_model, newdata = as.matrix(val_xgb_df)))^2)
# R_sq_bt_val <- 1-(SSE_bt_val/SST_bt_val)
# R_sq_bt_val #0.7243043
#
# bt_results <- bt_results %>% bind_rows(data.frame(n_trees=i, max_depth=j, R_sq_bt_train=R_sq_bt_train, R_sq_bt_val=R_sq_bt_val))
#
# }
# }
#
# plot_ly(x=bt_results$n_trees, y=bt_results$R_sq_bt_train, type='scatter', mode='lines+markers', color=as.factor(paste0('Train-',bt_results$max_depth))) %>%
# add_trace(x=bt_results$n_trees, y=bt_results$R_sq_bt_val, type='scatter', mode='lines+markers', color=as.factor(paste0('Val-',bt_results$max_depth))) %>%
# layout(title = 'Plot of Boosted Tree Train and Val R_Sq vs Number of Trees (depth in name)',
# xaxis = list(title = 'Number of Trees'),
# yaxis = list(title = 'R_Sq'))
#
#
# bt_results %>% filter(bt_results$R_sq_bt_val == max(bt_results$R_sq_bt_val))
# # n_trees max_depth R_sq_bt_train R_sq_bt_val
# # 12 5 0.3742168 0.7650132
# Best model from Train/Val tuning
set.seed(12345)
bt_model <- xgboost(data=as.matrix(train_xgb_df), label = train_df$n, objective='reg:squarederror', nthread=1, nrounds=12, max.depth=5, eta=0.05, verbose=0)
SST_bt_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_bt_train <- sum((train_df$n - predict(bt_model, newdata = as.matrix(train_xgb_df)))^2)
R_sq_bt_train <- 1-(SSE_bt_train/SST_bt_train)
R_sq_bt_train #0.3742168
## [1] 0.374216
#Boosted Tree Model Validation Metrics
SST_bt_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_bt_val <- sum((val_df$n - predict(bt_model, newdata = as.matrix(val_xgb_df)))^2)
R_sq_bt_val <- 1-(SSE_bt_val/SST_bt_val)
R_sq_bt_val #0.7650132
## [1] 0.765024
# Train_Val Boosted Tree model
set.seed(12345)
bt_model_2 <- xgboost(data=as.matrix(train_val_xgb_df), label = train_val_df$n, objective='reg:squarederror', nthread=1, nrounds=12, max.depth=5, eta=0.05, verbose=0)
xgb.plot.importance(xgb.importance(model=bt_model_2))

SST_bt_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_bt_train_val <- sum((train_val_df$n - predict(bt_model_2, newdata = as.matrix(train_val_xgb_df)))^2)
R_sq_bt_train_val <- 1-(SSE_bt_train_val/SST_bt_train_val)
R_sq_bt_train_val #0.4057451
## [1] 0.4057425
Visualisation
plot_ly(x=train_val_df$yr_mo, y=(predict(bt_model_2, newdata = as.matrix(train_val_xgb_df))-train_val_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Boosted Tree Model Residuals vs yr_mo for Train_Val Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'BT Residuals', rangemode = "tozero"))
Test Metrics
SST_bt_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_bt_test <- sum((test_df$n - predict(bt_model_2, newdata = as.matrix(test_xgb_df)))^2)
R_sq_bt_test <- 1-(SSE_bt_test/SST_bt_test)
R_sq_bt_test #0.2349657
## [1] 0.2349213
Our XGBoost model with training + validation data had an R-squared
value of 0.41, while it’s performance on test data dropped the R-squared
value to 0.23.
Ensemble Model
Training
# Ensemble final models for better performance
val_pred_actual <- val_df$n
val_pred_lm <- predict(lm_model_2, newdata = val_df)
val_pred_rf <- predict(rf_model_2, newdata = val_df)
val_pred_cart <- predict(cart_model_2, newdata = val_df)
val_pred_bt <- predict(bt_model_2, newdata = as.matrix(val_xgb_df))
ensemble_model <- lm(val_pred_actual ~ val_pred_lm + val_pred_rf + val_pred_cart + val_pred_bt)
summary(ensemble_model)
##
## Call:
## lm(formula = val_pred_actual ~ val_pred_lm + val_pred_rf + val_pred_cart +
## val_pred_bt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.626 -12.280 0.429 10.726 108.447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.641719 1.434035 -10.210 <2e-16 ***
## val_pred_lm 0.007157 0.009049 0.791 0.429
## val_pred_rf 0.315755 0.035711 8.842 <2e-16 ***
## val_pred_cart -0.085805 0.015223 -5.637 2e-08 ***
## val_pred_bt 1.840335 0.062587 29.404 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.6 on 1823 degrees of freedom
## Multiple R-squared: 0.9131, Adjusted R-squared: 0.9129
## F-statistic: 4787 on 4 and 1823 DF, p-value: < 2.2e-16
# => ensemble model will be -15.493854 + 1.746018*bt_model_2 + 0.378762*rf_model_2 - 0.088279*cart_model_2
# Ensemble model result calculations
val_pred_ensemble <- -15.493854 + 1.746018*val_pred_bt + 0.378762*val_pred_rf - 0.088279*val_pred_cart
SST_ensemble_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_ensemble_val <- sum((val_df$n - val_pred_ensemble)^2)
R_sq_ensemble_val <- 1-(SSE_ensemble_val/SST_ensemble_val)
R_sq_ensemble_val #0.9139694
## [1] 0.9128438
train_val_pred_ensemble <- -15.493854 + 1.746018*predict(bt_model_2, newdata = as.matrix(train_val_xgb_df)) + 0.378762*predict(rf_model_2, newdata = train_val_df) - 0.088279*predict(cart_model_2, newdata = train_val_df)
SST_ensemble_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_ensemble_train_val <- sum((train_val_df$n - train_val_pred_ensemble)^2)
R_sq_ensemble_train_val <- 1-(SSE_ensemble_train_val/SST_ensemble_train_val)
R_sq_ensemble_train_val #0.9624101
## [1] 0.961873
ensemble_plot_data <- data.frame(n_ensemble = as.numeric(train_val_pred_ensemble), n=train_val_df$n, yr_mo=train_val_df$yr_mo)
Visualisation
plot_ly(x=ensemble_plot_data$yr_mo, y=(ensemble_plot_data$n_ensemble - ensemble_plot_data$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Ensemble Model Residuals vs yr_mo for Train_Val Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'Ensemble Residuals', rangemode = "tozero"))
Test Metrics
test_pred_ensemble <- -15.493854 + 1.746018*predict(bt_model_2, newdata = as.matrix(test_xgb_df)) + 0.378762*predict(rf_model_2, newdata = test_df) - 0.088279*predict(cart_model_2, newdata = test_df)
SST_ensemble_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_ensemble_test <- sum((test_df$n - test_pred_ensemble)^2)
R_sq_ensemble_test <- 1-(SSE_ensemble_test/SST_ensemble_test)
R_sq_ensemble_test #0.8646812
## [1] 0.87298
Our Ensemble model with training + validation data had an R-squared
value of 0.96, while it’s performance on test data dropped the R-squared
value to 0.86.
Model comparison
Each model’s performance can be compared in the following interactive
plot. The red line represents a perfect model. Notice how much better
the ensemble model does than the other individual models.
Test Data
plot_ly(x=test_df$n, y=(predict(cart_model_2, newdata = test_df)), type='scatter', mode='markers', name='CART', marker=list(color='#1f77b4', opacity=0.5)) %>%
add_trace(x=test_df$n, y=(predict(rf_model_2, newdata = test_df)), type='scatter', mode='markers', name='RF', marker=list(color='#ff7f0e', opacity=0.5)) %>%
add_trace(x=test_df$n, y=(predict(lm_model_2, newdata = test_df)), type='scatter', mode='markers', name='LM', marker=list(color='#2ca02c', opacity=0.5)) %>%
add_trace(x=c(0,350), y=c(0,350), type='scatter', mode='lines+markers', name='Perfect', alpha=1, line=list(color='#d62728'), marker=list(color='#2ca02c', opacity=0)) %>%
add_trace(x=test_df$n, y=(predict(bt_model_2, newdata = as.matrix(test_xgb_df))), type='scatter', mode='markers', name='BT', marker=list(color='#9467bd', opacity=0.5)) %>%
add_trace(x=test_df$n, y=(test_pred_ensemble), type='scatter', mode='markers', name='Ensemble', marker=list(color='#8c564b', opacity=0.5)) %>%
layout(title = 'Plot of All Model Predictions vs Actual Values - Test Data',
xaxis = list(title = 'Actual Value', range=c(0,375)),
yaxis = list(title = 'Model Prediction', range=c(-200,350)))
Prediction
We continue with the assumption here is that our repair shop will
enjoy perfect competition. This would mean that it would have an equal
share of the demand i.e. 0.1408%. With this assumption in tow, we
estimate the shop’s future monthly car volume.
Computation
First, we formally compute the market share of the shop.
body_shop_count <- 710
Shop_Market_Share <- 0.1408/100
Now, we use the test data that we had set aside to predict the
monthly collision volume.
test_agg_df <- (test_df %>%
group_by(MONTH) %>%
summarise(Actual_NYC_Collisions=sum(n),
Actual_Shop_Volume=round(Shop_Market_Share*sum(n), 0)))
Here, we create dataset for future year (2021).
temp1 <- data.frame(TIMESTEP = c(48:59))
temp2 <- data.frame(MONTH = c('01','02','03','04','05','06','07','08','09','10','11','12'))
temp3 <- data.frame(PERSON_AGE = c(15:100))
temp4 <- data.frame(PERSON_SEX = c('M','F'))
monthly_predictions_df <- temp1 %>%
bind_cols(temp2) %>%
full_join(temp3, by=character()) %>%
full_join(temp4, by=character())
And finally, we make predictions.
monthly_predictions_xgb_df <- (monthly_predictions_df %>%
select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>%
mutate(MONTH=as.integer(as.factor(MONTH)),
PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
#Create predictions
monthly_predictions_df$rf <- predict(rf_model_2, newdata = monthly_predictions_df)
monthly_predictions_df$cart <- predict(cart_model_2, newdata = monthly_predictions_df)
monthly_predictions_df$bt <- predict(bt_model_2, newdata = as.matrix(monthly_predictions_xgb_df))
monthly_predictions_df$ensemble <- (-15.493854
+ 1.746018*monthly_predictions_df$bt
+ 0.378762*monthly_predictions_df$rf
- 0.088279*monthly_predictions_df$cart)
monthly_predictions_agg_df <- (monthly_predictions_df %>%
group_by(MONTH) %>%
summarise(Predicted_NYC_Collisions=round(sum(ensemble), 0)) %>%
mutate(Predicted_Shop_Volume = round(Shop_Market_Share*Predicted_NYC_Collisions, 0)) %>%
left_join(test_agg_df, by='MONTH') %>%
mutate(YEAR = 2021) %>%
select(YEAR, MONTH, Actual_NYC_Collisions, Actual_Shop_Volume, Predicted_NYC_Collisions, Predicted_Shop_Volume)
)
kable(monthly_predictions_agg_df) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
|
YEAR
|
MONTH
|
Actual_NYC_Collisions
|
Actual_Shop_Volume
|
Predicted_NYC_Collisions
|
Predicted_Shop_Volume
|
|
2021
|
01
|
9671
|
14
|
11472
|
16
|
|
2021
|
02
|
8432
|
12
|
11079
|
16
|
|
2021
|
03
|
10648
|
15
|
10953
|
15
|
|
2021
|
04
|
11501
|
16
|
10011
|
14
|
|
2021
|
05
|
13394
|
19
|
10761
|
15
|
|
2021
|
06
|
13745
|
19
|
10605
|
15
|
|
2021
|
07
|
12645
|
18
|
10742
|
15
|
|
2021
|
08
|
12473
|
18
|
10874
|
15
|
|
2021
|
09
|
12623
|
18
|
10874
|
15
|
|
2021
|
10
|
13097
|
18
|
11070
|
16
|
|
2021
|
11
|
11522
|
16
|
10745
|
15
|
|
2021
|
12
|
NA
|
NA
|
10589
|
15
|
Plot of actual versus predicted demand volume
#Plot of Actual_Shop_Volume and Predicted_Shop_Volume
plot_ly(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Actual_Shop_Volume,
type='bar', name='Actual_Shop_Volume') %>%
add_trace(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Predicted_Shop_Volume, type='bar', name='Predicted_Shop_Volume') %>%
layout(title = 'Plot of Actual_Shop_Volume and Predicted_Shop_Volume (Test Set)',
xaxis = list(title = 'Months in 2021'),
yaxis = list(title = 'Car volume'))
Primary research question response
Recall that our primary research
question is to “estimate the monthly demand for a fictitious auto
shop in New York state by extrapolating the monthly car crash volume
with reasonable accuracy”. We can see that our ensemble model actually
does a very good job predicting the shop volume, compared the actual
volume the shop saw in 2021. We hypothesize that our model would have
performed even better without the unforeseen consequences of COVID-19
and the subsequent increase of remote work availability.
Our next task is to start answering our secondary research
objectives, including identifying key demographics for use in a
marketing campaign and then measuring its cost and effect.
Proportion of crashes by month, gender
kable(train_df %>%
group_by(PERSON_SEX) %>%
summarise(n = sum(n)) %>%
arrange(PERSON_SEX)) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
|
PERSON_SEX
|
n
|
|
F
|
275792
|
|
M
|
779863
|
Proportion of crashes by month, age group
kable(train_df %>%
mutate(age_group = 5*floor(PERSON_AGE/5)) %>%
group_by(age_group) %>%
summarise(n = sum(n)) %>%
arrange(age_group)) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
|
age_group
|
n
|
|
15
|
22473
|
|
20
|
97546
|
|
25
|
139152
|
|
30
|
134433
|
|
35
|
119523
|
|
40
|
106615
|
|
45
|
102786
|
|
50
|
98369
|
|
55
|
87184
|
|
60
|
64833
|
|
65
|
39100
|
|
70
|
22739
|
|
75
|
11642
|
|
80
|
5883
|
|
85
|
2416
|
|
90
|
763
|
|
95
|
180
|
|
100
|
18
|
Marketing Campaign
With the assumption that the auto body repair shop has a 0.14% share
of the NYC business, we set a goal to increase it to about 0.25%. To do
so, we considered a few options and settled on a marketing campaign that
focuses on digital search advertisement.
We started off by pulling together the monthly numbers from 2021 and
added a new column for the revenue we expect to generate. Based on
information from the 2017 Shop Performance Survey
(Bean), the average repair in the United
States costs $200-$399. However, we chose to use $2400 as the average
repair order as collision repairs are significantly higher. The same
source also listed the average gross profit and and net profit margins
(49% and 14% respectively) for similar shops in the United States. Below
is a table showing the calculations.
#Average Repair Shop Financial Metrics
average_repair_order <- 2400 #Average repair is about $399
gross_profit_margin <- .49
net_profit_margin <- .14
marketing_campaign_df <- data.frame(monthly_predictions_agg_df)
marketing_campaign_df <- marketing_campaign_df %>%
mutate(monthly_revenue = marketing_campaign_df$Predicted_Shop_Volume* average_repair_order)
marketing_campaign_df %>%
kbl() %>%
kable_paper("hover", full_width = F, html_font = "Cambria")
|
YEAR
|
MONTH
|
Actual_NYC_Collisions
|
Actual_Shop_Volume
|
Predicted_NYC_Collisions
|
Predicted_Shop_Volume
|
monthly_revenue
|
|
2021
|
01
|
9671
|
14
|
11472
|
16
|
38400
|
|
2021
|
02
|
8432
|
12
|
11079
|
16
|
38400
|
|
2021
|
03
|
10648
|
15
|
10953
|
15
|
36000
|
|
2021
|
04
|
11501
|
16
|
10011
|
14
|
33600
|
|
2021
|
05
|
13394
|
19
|
10761
|
15
|
36000
|
|
2021
|
06
|
13745
|
19
|
10605
|
15
|
36000
|
|
2021
|
07
|
12645
|
18
|
10742
|
15
|
36000
|
|
2021
|
08
|
12473
|
18
|
10874
|
15
|
36000
|
|
2021
|
09
|
12623
|
18
|
10874
|
15
|
36000
|
|
2021
|
10
|
13097
|
18
|
11070
|
16
|
38400
|
|
2021
|
11
|
11522
|
16
|
10745
|
15
|
36000
|
|
2021
|
12
|
NA
|
NA
|
10589
|
15
|
36000
|
We also obtained digital advertising data from the 2021 Paid
Search Advertising Benchmarks for Every Industry article by
Kristen McCormick (McCormick). We noticed
that both the click-through and conversion rate, 5.39% and 15.23%
respectively, were very good and presented a real opportunity to archive
our goal of increasing the market share. For comparison, we ran the
numbers based on three levels of spend ($200, $300, and $400). We
summarized the calculations in the table below.
# 2021 Paid Search Industry Benchmarks (from Kristen McCormick article)
CTR <- 0.0539 #Average click-through rate 5.39%
CPC <- 3.19 #Average cost per click
conversion_rate <- 0.1523 #Average conversion rate is 15.23%
cost_per_lead <- 17.81
#Create a function that will create a column for each level of spend
create_col <- function(spend) {
clicks <- spend/CPC
conversion <- clicks*conversion_rate
revenue <- conversion*average_repair_order
ROAS <- revenue/spend
market_share_lif <- conversion / marketing_campaign_df$Predicted_NYC_Collisions[12] * 100
grs_prf <- revenue * gross_profit_margin
net_prft <- revenue * net_profit_margin
ret_on_inv <- net_prft/spend
round(c(clicks, conversion, revenue, ROAS, market_share_lif, grs_prf, net_prft, ret_on_inv),2)
}
#create and display a dataframe for the three levels of spend
kable(data.frame(Campaign_KPIs = c('Clicks' , 'Conversion' , 'Revenue' , 'ROAS' ,
'Market Share Lift' , 'Gross Profit' , 'Net Profit' , 'ROI'),
'USD200' = create_col(200),
'USD300' = create_col(300),
'USD400' = create_col(400))) %>%
kable_paper("hover", full_width = F, html_font = "Cambria") %>%
column_spec(3, color =c("white"), background = "green")
|
Campaign_KPIs
|
USD200
|
USD300
|
USD400
|
|
Clicks
|
62.70
|
94.04
|
125.39
|
|
Conversion
|
9.55
|
14.32
|
19.10
|
|
Revenue
|
22916.61
|
34374.92
|
45833.23
|
|
ROAS
|
114.58
|
114.58
|
114.58
|
|
Market Share Lift
|
0.09
|
0.14
|
0.18
|
|
Gross Profit
|
11229.14
|
16843.71
|
22458.28
|
|
Net Profit
|
3208.33
|
4812.49
|
6416.65
|
|
ROI
|
16.04
|
16.04
|
16.04
|
From the summary, it appears that spending $300 per month on digital
advertising will allow us to get very close to our goal of 0.25% of the
market share.
Effects of COVID-19 on NYC Vehicle Collisions
The first reported case of COVID-19 was confirmed in New York on
March 1, 2020, while later reports suggest that the first case could
have been as early as January. Researchers later estimated that at the
time the first case was confirmed, as many as 10,700 New Yorkers had
already contracted the virus (Lewis). By
March 9, there were 16 confirmed cases in NYC alone and on March 16,
Governor Andrew Cuomo announced public schools would be closed,
effectively beginning the COVID-19 lockdown. This remained in effect
until June 8, when the governor announced the first phase of reopening
under safety protocols. By the end of 2020, researchers estimate that
44% of all metro New York residents had been infected. To date, over
35,000 deaths of New York citizens have been attributed to COVID-19,
with at least another 5,000 listed as probable deaths (Health).
#filter data, create combined_df2
combined_df2 <- (crashes_df %>%
select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>%
inner_join(vehicles_df %>%
filter(!is.na(VEHICLE_ID)) %>%
select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
, by = 'COLLISION_ID') %>%
inner_join(person_df %>%
select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
, by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>%
filter((PED_ROLE %in% c('Driver'))) %>% #drivers only
filter((CRASH_DATE >= '2019-01-01') & (CRASH_DATE < '2020-12-31')) %>%
#only from 2017-01-01 to 2021-11-30
filter(PERSON_AGE > 14 & PERSON_AGE < 101) %>%
filter(PERSON_SEX == 'F' | PERSON_SEX == 'M') %>%
mutate(DAY = substr(CRASH_DATE,9,10)) %>%
mutate(TIMESTEP2 = as.period(interval(as.Date('2019-01-01'), CRASH_DATE)) %/% days(1)) %>%
####add TIMESTEP2
mutate(yr_mo_day = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7),'-',substr(CRASH_DATE,9,10))) %>%
######add yr_mo_day
select(COLLISION_ID, CRASH_DATE, DAY, TIMESTEP2, yr_mo_day, NUMBER.OF.PERSONS.KILLED,
PERSON_AGE, PERSON_SEX) %>%
####ADD KILLED, DAY, TIMESTEP2, yr_mo_day
arrange(CRASH_DATE)
)
kable(t(summary(combined_df2))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
|
|
|
|
|
|
|
|
|
COLLISION_ID
|
Min. :3822228
|
1st Qu.:4133700
|
Median :4211392
|
Mean :4213024
|
3rd Qu.:4289851
|
Max. :4479833
|
|
CRASH_DATE
|
Min. :2019-01-01
|
1st Qu.:2019-05-16
|
Median :2019-09-21
|
Mean :2019-10-19
|
3rd Qu.:2020-02-15
|
Max. :2020-12-30
|
|
DAY
|
Length:482737
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
|
TIMESTEP2
|
Min. : 0.0
|
1st Qu.:136.0
|
Median :263.0
|
Mean :291.5
|
3rd Qu.:409.0
|
Max. :729.0
|
|
yr_mo_day
|
Length:482737
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
|
NUMBER.OF.PERSONS.KILLED
|
Min. :0.000000
|
1st Qu.:0.000000
|
Median :0.000000
|
Mean :0.001483
|
3rd Qu.:0.000000
|
Max. :4.000000
|
|
PERSON_AGE
|
Min. : 15.00
|
1st Qu.: 29.00
|
Median : 40.00
|
Mean : 41.65
|
3rd Qu.: 53.00
|
Max. :100.00
|
|
PERSON_SEX
|
Length:482737
|
Class :character
|
Mode :character
|
NA
|
NA
|
NA
|
Explore Crash Data by Day, 2019-2020
daily_agg1 <- (combined_df2 %>%
group_by(TIMESTEP2, yr_mo_day, DAY) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP2, yr_mo_day, DAY)
)
plot_ly(x=daily_agg1$yr_mo_day, y=daily_agg1$n, type='bar') %>%
layout(title = "Crash Data by Day from 2019-2020", xaxis = list(title = 'Day'), yaxis = list(title = 'Number of Crashes'))
#create covid_df for COVID analysis
covid_df <- (combined_df2 %>%
filter((CRASH_DATE >= '2019-01-01') & (CRASH_DATE < '2020-12-31')) %>% #only from 2019-01-01 to 2020-12-31))
mutate(covid = ifelse(CRASH_DATE >= '2020-03-16', 1, 0)) %>%
rename(fatalities = NUMBER.OF.PERSONS.KILLED) %>%
group_by(TIMESTEP2, yr_mo_day, DAY, covid, fatalities) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP2, yr_mo_day, DAY, covid, fatalities)
)
head(covid_df)
Linear Regression with Indicator Variable
\[
Crashes = \beta_0 + \beta_1*Covid
\]
We used COVID as an indicator variable and the number of daily
crashes in NYC as the response variable. The base case when the variable
covid=0 represents January 1, 2019 - March 15, 2020. When covid=1, this
represents March 16, 2020 - Dec 31, 2020.
#linear regression with only covid indicator variable
covid_model1 <- lm(n ~ covid, data = covid_df)
summary(covid_model1)
##
## Call:
## lm(formula = n ~ covid, data = covid_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -600.8 -230.8 102.7 245.5 708.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 601.80 13.77 43.71 <2e-16 ***
## covid -369.04 21.73 -16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 347.5 on 1062 degrees of freedom
## Multiple R-squared: 0.2135, Adjusted R-squared: 0.2128
## F-statistic: 288.3 on 1 and 1062 DF, p-value: < 2.2e-16
Both the intercept and the B1 values are statistically significant.
We can conclude from this model that before COVID, there was an average
of 602 crashes a day in NYC. This number dropped to 233 crashes a day
after COVID. This shows a large and significant decrease in the number
of overall crashes by 61.3%.
According to the National Highway Traffic Safety Administration from
the US Department of Transportation, there are 13.6 million vehicle
collisions each year in the United State and the average cost of vehicle
repair per collision is $21,036 (Barnes et
al.). Our research also indicates that 22 percent of collision
claims involve vehicles that are totaled (Vallet). We assume that accidents not totaled
will be potential business for our auto body repair center. We then
conclude that the average accidents a day, removing totaled collisions,
from 2019 though the beginning of COVID-19 is 470, while the average
accidents a day after COVID-19 through the end of 2020 is 182.
We estimate 6,054,581 dollars in lost revenue for auto body shops in
NYC post COVID. Assuming .14% market share for our fictitious auto body
shop, lost revenue totals 8,476 dollars a day. Furthermore, we estimate
that the total revenue lost during the COVID-19 lockdown in NYC from
March - June 2020 is 712,019 dollars.
Future work
To build on this work, one can perform further data analysis to
- Understand differences between traffic collisions that involve
cyclists and those that do not. Research by Li and Zhao points to
significant differences between cyclists and drivers experiences of
vehicle collisions with cyclists experiencing 3 times as many collisions
as drivers with fatal outcomes.
- Determine features of vehicular and cycling collision hot
spots.
- Use the sophisticated formulae for traffic exposure from Regev et.
al. (133) on new data sets to see if it applies in other places.
- Perform more data exploration through advanced visualizations to
pull out and analyze any other trends in the data.
Works Cited
Barnes, Stephen R. et al.
“COVID-19 Lockdown and Traffic
Accidents: Lessons from the Pandemic.” *
Contemporary Economic
Policy* 40.2 (2022) : 349–368. <
https://onlinelibrary.wiley.com/doi/abs/10.1111/coep.12562>.
Bean, Travis.
“The 2017 Shop Performance Survey.” 1 Nov.
2017. 16 Jul. 2022. <
https://www.ratchetandwrench.com/articles/5258-the-2017-shop-performance-survey>.
Health, NYC.
“COVID-19:data–NYC Health.” 1 Apr. 2020. 17
Jul. 2022. <
https://www1.nyc.gov/site/doh/covid/covid-19-data.page>.
Lam, Lawrence T.
“Distractions and the Risk of Car Crash Injury:
The Effect of Drivers’ Age.” *
Journal of Safety
Research* 33.3 (2002) : 411–419. <
https://www.sciencedirect.com/science/article/pii/S0022437502000348>.
Li, Jintai, and Zhan Zhao.
“Accident; Analysis and
Prevention.” 167 (2022) : n. pag. <
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8806026/>.
McCormick, Kristen.
“2021 Paid Search Advertising Benchmarks for
Every Industry.” 20 May 2022. 16 Jul. 2022. <
https://www.wordstream.com/blog/ws/2021/10/13/search-advertising-benchmarks>.
Regev, Shirley, Jonathan J. Rolison, and Salissou Moutari.
“Crash
Risk by Driver Age, Gender, and Time of Day Using a New Exposure
Methodology.” *
Journal of Safety Research* 66 (2018) :
131–140. <
https://www.sciencedirect.com/science/article/pii/S0022437517307600>.
Thompson, Karl.
“What Is Normal?” 3 Sep. 2018. 4 Jul. 2022.
<
https://revisesociology.com/2018/09/03/what-is-normal/>.
Vallet, Mark.
“Total-Loss Thresholds by State.” 16 Sep.
2021. 17 Jul. 2020. <
https://www.carinsurance.com/Articles/total-loss-thresholds.aspx>.
Volovich, Kristina.
“What’s a Good Clickthrough Rate? New
Benchmark Data for Google AdWords.” 4 Jul. 2022. <
https://blog.hubspot.com/agency/google-adwords-benchmark-data>.
Williams, Allan F.
“Teenage Drivers: Patterns of Risk.”
*
Journal of Safety Research* 34.1 (2003) : 5–15. <
https://www.sciencedirect.com/science/article/pii/S0022437502000750>.